Alzheimer’s disease is a neurological disorder characterized by the degeneration of brain cells, leading to dementia in 60% of all cases. The disease manifests itself through decreased cognitive capabilities and a reduced ability to behave independently (Breijyeh, Karaman, 2020). In the early stages, Alzheimer’s involve severe memory loss, apathy, and depression, with later stages being marked with communication problems, behavioral changes, as well as difficulties with walking and speaking (Alzheimer’s Association 2024 Alzheimer’s Disease Facts and Figures, 2024). Alzheimer’s disease is widespread and deadly. With over 7 million people in the United States living with Alzheimer’s, it is the fifth leading cause of death in people aged 65 and older (Alzheimer’s Association 2024 Alzheimer’s Disease Facts and Figures, 2024).
As demonstrated, Alzheimer’s disease is critical to be studied due to its prevalence and severity. Given the complex nature of neurodegenerative disease, early symptoms often go unnoticed by patients themselves, making Alzheimer’s difficult to diagnose until significant cognitive decline occurs (Breijyeh, Karaman, 2020). The combination of Alzheimer’s being difficult to recognize for medical professionals and difficult to notice by patients, themselves, makes it an often deadly threat. Thus, the primary objective of this research project is to identify the key determinants of Alzheimer’s disease. Uncovering such factors is vital and crucial nowadays, as no effective medicine has been found yet. Therefore, early detection of Alzheimer’s disease as well enhancing preventive measures are considered crucial in minimizing the impact of dementia.
The goal of alleviating and potentially preventing Alzheimer’s disease can be achieved by using data science and predictive modeling. More specifically, answering the key research question: What are the determinants and potential predictors of Alzheimer’s disease? can be achieved via application of the predictive models which may help to identify key features associated with dementia.
The Alzheimer’s Disease Dataset has been acquired from Kaggle. The author, Rabie El Kharoua, created this dataset offering extensive insights into the factors associated with Alzheimer’s Disease. The variables include: the diagnosis of Alzheimer’s disease, demographic data of patients (age, gender, ethnicity, education level), lifestyle factors (e.g. BMI, smoking habits, alcohol consumption), medical history(e.g. cardiovascular disease, diabetes), clinical measurements (e.g. cholesterol levels), cognitive and functional assessments (e.g. MMSE, ADL), and symptoms (e.g. confusion, personality changes, forgetfulness). These variables are highly relevant to this study because they are commonly associated with and tracked in people with Alzheimer’s (Breijyeh, Karaman, 2020), and so can be utilized in this research to determine which can most optimally predict Alzheimer’s.
This study aims to answer the research question: What are the determinants and potential predictors of Alzheimer’s disease? through a few phases. After data preprocessing (including handling missing data and identifying outliers), exploratory data analysis will be conducted to gain insights into the dataset’s main characteristics. Then, by developing predictive models such as logistic regression, Naive Bayes Classifier and k-nearest neighbors (kNN), key features associated with Alzheimer’s disease can be identified. The findings extracted from the data analysis will contribute to enhancing early detection and developing recommendations for mitigating the impact of Alzheimer’s disease.
To complement the analysis, following R libraries will be used throughout the research:
library(ggplot2) - A package for creating customizable
data visualisations.library(plyr) - A toolkit for splitting and combining
datalibrary(Hmisc) - Toolkit used for data analysis, in
this project mainly used for managing missing values.library(naniar) - Toolkit used for visualizing missing
values.library(liver) - Tools for splitting data sets and
other metrics.library(naivebayes) - Implements the ‘Naive-Bayes’
classification algorithmlibrary(pROC) - Used to visualize the performance of
classification models by creating a ROC curve and calculating the
AUC.library(psych) - A toolkit for psychometric and
correlation analysis.##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'liver'
## The following object is masked from 'package:base':
##
## transform
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## naivebayes 1.0.0 loaded
## For more information please visit:
## https://majkamichal.github.io/naivebayes/
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Our research adopts an exploratory approach to understand the predictors of Alzheimer’s diagnosis. Thus, the aim is not only to assess the immediate relationships between these variables and Alzheimer’s diagnosis but also to uncover patterns that could inform future research, diagnosis, and potential preventive measures.
Given the complexity of Alzheimer’s disease and the multifaceted nature of its risk factors, we will formulate hypotheses at a group level, rather than focusing on individual predictors. The hypotheses will be exploratory in nature, allowing for an investigation into how demographic, lifestyle, and medical factors, among others, might collectively contribute to the onset of Alzheimer’s. By grouping predictors, we hope to identify clusters of risk factors and or individual factors that might provide insight into the mechanisms behind Alzheimer’s development. By understanding which factors are associated with increased or decreased likelihoods of Alzheimer’s diagnosis, we can potentially improve early detection strategies and guide interventions aimed at reducing risk.
Alzheimer’s disease is widely regarded as a multifactorial condition, meaning that its risk is influenced by a combination of different factors, including genetic, environmental and lifestyle factors (Breijyeh, Karaman, 2020). Despite extensive research, mechanisms that cause the pathological changes related to Alzheimer’s disease remain unknown (Breijyeh, Karaman, 2020). Since the underlying mechanism remains elusive, focusing on modifiable risk factors like lifestyle is crucial for mitigating disease progression in the absence of a cure.
Aging is one of the most prominent demographic risk factors for Alzheimer’s disease. Research shows that it is highly uncommon for young individuals to develop Alzheimer’s, with the vast majority of cases occurring in individuals 65 and above (Breijyeh, Karaman, 2020). This makes age a crucial determinant in understanding the onset and progression of Alzheimer’s, highlighting the need for targeted screening and preventive measures in older populations. Thus, the following hypothesis is constructed.
H1: Older age increases the likelihood of Alzheimer’s diagnosis.
Lifestyle factors also contribute to Alzheimer’s disease risk. For example, exposure to air pollution has been linked to increased production of peptides commonly associated with decreased cognitive function (Breijyeh, Karaman, 2020). Moreover, saturated fatty acids and high-calorie diets have been found to lead to an increased incidence of Alzheimer’s, with malnutrition also exacerbating the condition (Breijyeh, Karaman, 2020). This highlights the significance of lifestyle choices in the progression of disease, and so our hypothesis is as follows.
H2: A healthier lifestyle, characterized by a lower BMI, non smoking status, low alcohol consumption, regular physical activity, good diet quality, and better sleep quality, is associated with a lower likelihood of Alzheimer’s diagnosis.
Medical history is another important risk factor for Alzheimer’s. Cardiovascular diseases, diabetes, and obesity have all been linked to Alzheimer’s risk. For example, while obesity does not directly cause Alzheimer’s, it puts individuals at higher risk for cancer or cardiovascular disease, indirectly increasing the risk of Alzheimer’s disease (Breijyeh, Karaman, 2020). As it pertains to the factors within medical history that can be controlled (ie. obesity), these are especially important because they offer avenues for intervention. Thus, the following hypothesis will be tested.
H3: A history of chronic health conditions such as cardiovascular disease, diabetes, depression, hypertension, and head injury, as well as a family history of Alzheimer’s, increase the likelihood of an Alzheimer’s diagnosis.
Clinical measurements such as blood pressure, cholesterol levels, and cognitive and functional assessments provide additional insights into Alzheimer’s risk. Cholesterol levels, for instance, can contribute to the development of Alzheimer’s by accumulating in the brain tissue (Breijyeh, Karaman, 2020). Furthermore, those with Alzheimer’s or who are suspected of having it often undergo mental and physical examinations such as MMSE to assess their cognitive capabilities and functional assessment. Patients typically perform worse on these assessments compared to those without Alzheimer’s, reflecting the progressive nature of the disease (Guk-Hee et al., 2004).
H4: Poor cardiovascular health, indicated by high blood pressure and unfavorable cholesterol levels (high total cholesterol, high LDL, low HDL, and high triglycerides), is associated with a higher likelihood of Alzheimer’s diagnosis.
H5: Lower cognitive and functional scores (e.g., MMSE, functional assessment) and the presence of memory complaints or behavioral problems is associated with a higher likelihood of Alzheimer’s diagnosis.
Alzheimer’s symptoms, manifesting in behavioral issues, memory loss, and disorientation, are often noticed by others before the patients are aware of their cognitive decline. Therefore, early symptom identification is critical, as it prompts further diagnostic evaluations (Breijyeh & Karaman, 2020).
H6: The presence of cognitive and behavioral symptoms such as confusion, disorientation, personality changes, difficulty completing tasks, and forgetfulness is positively associated with Alzheimer’s diagnosis.
We import the csv file in R as follows.
To see the overview of the dataset in R, we are using function
str() as follows:
## 'data.frame': 2149 obs. of 35 variables:
## $ PatientID : int 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 ...
## $ Age : int 73 89 73 74 89 86 68 75 72 87 ...
## $ Gender : int 0 0 0 1 0 1 0 0 1 0 ...
## $ Ethnicity : int 0 0 3 0 0 1 3 0 1 0 ...
## $ EducationLevel : int 2 0 1 1 0 1 2 1 0 0 ...
## $ BMI : num 22.9 26.8 17.8 33.8 20.7 ...
## $ Smoking : int 0 0 0 1 0 0 1 0 0 1 ...
## $ AlcoholConsumption : num 13.3 4.54 19.56 12.21 18.45 ...
## $ PhysicalActivity : num 6.33 7.62 7.84 8.43 6.31 ...
## $ DietQuality : num 1.347 0.519 1.826 7.436 0.795 ...
## $ SleepQuality : num 9.03 7.15 9.67 8.39 5.6 ...
## $ FamilyHistoryAlzheimers : int 0 0 1 0 0 0 0 0 0 0 ...
## $ CardiovascularDisease : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Diabetes : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Depression : int 1 0 0 0 0 0 0 0 0 0 ...
## $ HeadInjury : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Hypertension : int 0 0 0 0 0 0 0 0 1 0 ...
## $ SystolicBP : int 142 115 99 118 94 168 143 117 117 130 ...
## $ DiastolicBP : int 72 64 116 115 117 62 88 63 119 78 ...
## $ CholesterolTotal : num 242 231 284 160 238 ...
## $ CholesterolLDL : num 56.2 193.4 153.3 65.4 92.9 ...
## $ CholesterolHDL : num 33.7 79 69.8 68.5 56.9 ...
## $ CholesterolTriglycerides : num 162.2 294.6 83.6 277.6 291.2 ...
## $ MMSE : num 21.46 20.61 7.36 13.99 13.52 ...
## $ FunctionalAssessment : num 6.52 7.12 5.9 8.97 6.05 ...
## $ MemoryComplaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BehavioralProblems : int 0 0 0 1 0 0 0 0 1 1 ...
## $ ADL : num 1.7259 2.5924 7.1195 6.4812 0.0147 ...
## $ Confusion : int 0 0 0 0 0 1 0 1 0 0 ...
## $ Disorientation : int 0 0 1 0 0 0 0 0 0 0 ...
## $ PersonalityChanges : int 0 0 0 0 1 0 0 0 1 0 ...
## $ DifficultyCompletingTasks: int 1 0 1 0 1 0 0 0 0 0 ...
## $ Forgetfulness : int 0 1 0 0 0 0 1 1 0 0 ...
## $ Diagnosis : int 0 0 0 0 0 0 0 1 0 0 ...
## $ DoctorInCharge : chr "XXXConfid" "XXXConfid" "XXXConfid" "XXXConfid" ...
From the output of str(data), it can be seen that we
have 2149 observations with 35 variables. As the variable PatientID only
represents the identity of the observation, it will not be included in
the statistical and the explanatory data analyses. However, we keep it
as an indicative variable. As the variable DoctorInCharge contains
confidential information about the doctor responsible for each patient
and does not contribute to the statistical analysis, it will be excluded
from our analysis too. Out of the 33 variables that will be actively
used in the model, Diagnosis will be the target variable
for the main hypothesis of the research. The definition and the meanings
of each variable is shown as follows:
PatientID: (Categorical-Nominal) A unique identifier
assigned to each patient (4751 to 6900).Age: (Numerical-Discrete) The age of the patients
ranges from 60 to 90 years.Gender: (Categorical-Binary) Gender of the patients,
where 0 represents Male and 1 represents Female.Ethnicity: (Categorical-Nominal) The ethnicity of the
patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian,
3: Other.EducationLevel: (Categorical-Nominal) The education
level of the patients, coded as follows: 0: None, 1: High School, 2:
Bachelor’s, 3: Higher.BMI: (Numerical-Continuous) Body Mass Index of the
patients, ranging from 15 to 40.Smoking: (Categorical-Binary) smoking status, where 0
indicates No and 1 indicates Yes.AlcoholConsumption: (Numerical-Continuous) Weekly
alcohol consumption in units, ranging from 0 to 20.PhysicalActivity: (Numerical-Continuous) Weekly
physical activity in hours, ranging from 0 to 10.DietQuality: ( Numerical-Continuous) Diet quality
score, ranging from 0 to 10.SleepQuality: (Numerical-Continuous) Sleep quality
score, ranging from 4 to 10.FamilyHistoryAlzheimers: (Categorical-Binary) Family
history of Alzheimer’s Disease, where 0 indicates No and 1 indicates
Yes.CardiovascularDisease: (Categorical-Binary) Presence of
cardiovascular disease, where 0 indicates No and 1 indicates Yes.Diabetes: (Categorical-Binary) Presence of diabetes,
where 0 indicates No and 1 indicates Yes.Depression: (Categorical-Binary) Presence of
depression, where 0 indicates No and 1 indicates Yes.HeadInjury: (Categorical-Binary) History of head
injury, where 0 indicates No and 1 indicates Yes.Hypertension: (Categorical-Binary) Presence of
hypertension, where 0 indicates No and 1 indicates Yes.SystolicBP: (Numerical-Discrete) Systolic blood
pressure, ranging from 90 to 180 mmHg.DiastolicBP: (Numerical-Discrete) Diastolic blood
pressure, ranging from 60 to 120 mmHg.CholesterolTotal: (Numerical-Continuous) Total
cholesterol levels, ranging from 150 to 300 mg/dL.CholesterolLDL: (Numerical-Continuous) Low-density
lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.CholesterolHDL: (Numerical-Continuous) High-density
lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.CholesterolTriglycerides: (Numerical-Continuous)
Triglycerides levels, ranging from 50 to 400 mg/dL.MMSE: (Numerical-Continuous) Mini-Mental State
Examination score, ranging from 0 to 30. Lower scores indicate cognitive
impairment.FunctionalAssessment: ( Numerical-Continuous)
Functional assessment score, ranging from 0 to 10. Lower scores indicate
greater impairment.MemoryComplaints: (Categorical-Binary) Presence of
memory complaints, where 0 indicates No and 1 indicates Yes.BehavioralProblems: (Categorical-Binary) Presence of
behavioral problems, where 0 indicates No and 1 indicates Yes.ADL: (Numerical-Continuous) Activities of Daily Living
score, ranging from 0 to 10. Lower scores indicate greater
impairment.Confusion: ( Categorical-Binary) Presence of confusion,
where 0 indicates No and 1 indicates Yes.Disorientation: (Categorical-Binary) Presence of
disorientation, where 0 indicates No and 1 indicates Yes.PersonalityChanges: (Categorical-Binary) Presence of
personality changes, where 0 indicates No and 1 indicates Yes.DifficultyCompletingTasks: (Categorical-Binary)
Presence of difficulty completing tasks, where 0 indicates No and 1
indicates Yes.Forgetfulness: (Categorical-Binary) Presence of
forgetfulness, where 0 indicates No and 1 indicates Yes.Diagnosis: (Categorical-Binary) Diagnosis status for
Alzheimer’s Disease, where 0 indicates No and 1 indicates Yes.Considering the interpretation of each variable, the initial section
of the dataframe is presented by using the head() function
to have a clear understanding.
## PatientID Age Gender Ethnicity EducationLevel BMI Smoking
## 1 4751 73 0 0 2 22.92775 0
## 2 4752 89 0 0 0 26.82768 0
## 3 4753 73 0 3 1 17.79588 0
## 4 4754 74 1 0 1 33.80082 1
## 5 4755 89 0 0 0 20.71697 0
## 6 4756 86 1 1 1 30.62689 0
## AlcoholConsumption PhysicalActivity DietQuality SleepQuality
## 1 13.297218 6.3271125 1.3472143 9.025679
## 2 4.542524 7.6198845 0.5187671 7.151293
## 3 19.555085 7.8449878 1.8263347 9.673574
## 4 12.209266 8.4280014 7.4356041 8.392554
## 5 18.454356 6.3104607 0.7954975 5.597238
## 6 4.140144 0.2110616 1.5849220 7.261953
## FamilyHistoryAlzheimers CardiovascularDisease Diabetes Depression HeadInjury
## 1 0 0 1 1 0
## 2 0 0 0 0 0
## 3 1 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 1 0 0
## Hypertension SystolicBP DiastolicBP CholesterolTotal CholesterolLDL
## 1 0 142 72 242.3668 56.15090
## 2 0 115 64 231.1626 193.40800
## 3 0 99 116 284.1819 153.32276
## 4 0 118 115 159.5822 65.36664
## 5 0 94 117 237.6022 92.86970
## 6 0 168 62 280.7125 198.33463
## CholesterolHDL CholesterolTriglycerides MMSE FunctionalAssessment
## 1 33.68256 162.18914 21.463532 6.518877
## 2 79.02848 294.63091 20.613267 7.118696
## 3 69.77229 83.63832 7.356249 5.895077
## 4 68.45749 277.57736 13.991127 8.965106
## 5 56.87430 291.19878 13.517609 6.045039
## 6 79.08050 263.94365 27.517529 5.510144
## MemoryComplaints BehavioralProblems ADL Confusion Disorientation
## 1 0 0 1.72588346 0 0
## 2 0 0 2.59242413 0 0
## 3 0 0 7.11954774 0 1
## 4 0 1 6.48122586 0 0
## 5 0 0 0.01469122 0 0
## 6 0 0 9.01568628 1 0
## PersonalityChanges DifficultyCompletingTasks Forgetfulness Diagnosis
## 1 0 1 0 0
## 2 0 0 1 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 1 1 0 0
## 6 0 0 0 0
## DoctorInCharge
## 1 XXXConfid
## 2 XXXConfid
## 3 XXXConfid
## 4 XXXConfid
## 5 XXXConfid
## 6 XXXConfid
Next, summary() function is used to summarize the
dataframe and the characteristics of the variables.
## PatientID Age Gender Ethnicity
## Min. :4751 Min. :60.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:5288 1st Qu.:67.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :5825 Median :75.00 Median :1.0000 Median :0.0000
## Mean :5825 Mean :74.91 Mean :0.5063 Mean :0.6975
## 3rd Qu.:6362 3rd Qu.:83.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :6899 Max. :90.00 Max. :1.0000 Max. :3.0000
## EducationLevel BMI Smoking AlcoholConsumption
## Min. :0.000 Min. :15.01 Min. :0.0000 Min. : 0.002003
## 1st Qu.:1.000 1st Qu.:21.61 1st Qu.:0.0000 1st Qu.: 5.139810
## Median :1.000 Median :27.82 Median :0.0000 Median : 9.934412
## Mean :1.287 Mean :27.66 Mean :0.2885 Mean :10.039442
## 3rd Qu.:2.000 3rd Qu.:33.87 3rd Qu.:1.0000 3rd Qu.:15.157931
## Max. :3.000 Max. :39.99 Max. :1.0000 Max. :19.989293
## PhysicalActivity DietQuality SleepQuality FamilyHistoryAlzheimers
## Min. :0.003616 Min. :0.009385 Min. : 4.003 Min. :0.0000
## 1st Qu.:2.570626 1st Qu.:2.458455 1st Qu.: 5.483 1st Qu.:0.0000
## Median :4.766424 Median :5.076087 Median : 7.116 Median :0.0000
## Mean :4.920202 Mean :4.993138 Mean : 7.051 Mean :0.2522
## 3rd Qu.:7.427899 3rd Qu.:7.558625 3rd Qu.: 8.563 3rd Qu.:1.0000
## Max. :9.987429 Max. :9.998346 Max. :10.000 Max. :1.0000
## CardiovascularDisease Diabetes Depression HeadInjury
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1443 Mean :0.1508 Mean :0.2006 Mean :0.0926
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Hypertension SystolicBP DiastolicBP CholesterolTotal
## Min. :0.0000 Min. : 90.0 Min. : 60.00 Min. :150.1
## 1st Qu.:0.0000 1st Qu.:112.0 1st Qu.: 74.00 1st Qu.:190.3
## Median :0.0000 Median :134.0 Median : 91.00 Median :225.1
## Mean :0.1489 Mean :134.3 Mean : 89.85 Mean :225.2
## 3rd Qu.:0.0000 3rd Qu.:157.0 3rd Qu.:105.00 3rd Qu.:262.0
## Max. :1.0000 Max. :179.0 Max. :119.00 Max. :300.0
## CholesterolLDL CholesterolHDL CholesterolTriglycerides MMSE
## Min. : 50.23 Min. :20.00 Min. : 50.41 Min. : 0.005312
## 1st Qu.: 87.20 1st Qu.:39.10 1st Qu.:137.58 1st Qu.: 7.167602
## Median :123.34 Median :59.77 Median :230.30 Median :14.441660
## Mean :124.34 Mean :59.46 Mean :228.28 Mean :14.755132
## 3rd Qu.:161.73 3rd Qu.:78.94 3rd Qu.:314.84 3rd Qu.:22.161028
## Max. :199.97 Max. :99.98 Max. :399.94 Max. :29.991381
## FunctionalAssessment MemoryComplaints BehavioralProblems ADL
## Min. :0.00046 Min. :0.000 Min. :0.0000 Min. : 0.001288
## 1st Qu.:2.56628 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 2.342836
## Median :5.09444 Median :0.000 Median :0.0000 Median : 5.038973
## Mean :5.08005 Mean :0.208 Mean :0.1568 Mean : 4.982958
## 3rd Qu.:7.54698 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.: 7.581490
## Max. :9.99647 Max. :1.000 Max. :1.0000 Max. : 9.999747
## Confusion Disorientation PersonalityChanges DifficultyCompletingTasks
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2052 Mean :0.1582 Mean :0.1508 Mean :0.1587
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Forgetfulness Diagnosis DoctorInCharge
## Min. :0.0000 Min. :0.0000 Length:2149
## 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :0.0000 Median :0.0000 Mode :character
## Mean :0.3015 Mean :0.3537
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
In order to conduct a statistical research, we need to clean and prepare the data set for an analysis. Namely, outliers need to be detected and adjusted accordignly and the missing values have to be handled. In this sense, the data preprocessing stage is essential to obtain a viable trend for the further analysis and a prediction.
To check for the missing values (NA’s), in R, we use the plot the
missing values through the gg_miss_var() function:
The plot of missing values reveals that the data frame does not contain
any NA’s. Thus, we can contintue with the analysis.
Outlier detection will be performed on numerical variables. A combination of visual and statistical techniques will be employed to identify potential outliers in the dataset. Specifically, histograms will be used to provide a visual representation of the data distribution, allowing for the detection of unusually extreme values. In addition to this, the Interquartile Range (IQR) method will be applied to quantify the spread of the data, identifying any data points that fall outside the expected range by calculating the distance between the first and third quartiles. Potential outliers will not be removed as this can significantly alter the results of the analysis. Instead they will be replaced by random values. This approach will be implemented in the following analysis.
Q1 <- quantile(data$Age, 0.25)
Q3 <- quantile(data$Age, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$Age < lower_bound | data$Age > upper_bound, ]
outliers## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Age variable have been detected.
Q1 <- quantile(data$BMI, 0.25)
Q3 <- quantile(data$BMI, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$BMI < lower_bound | data$BMI > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in BMI variable have been detected.
ggplot(data = data, aes(x = AlcoholConsumption)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$AlcoholConsumption, 0.25)
Q3 <- quantile(data$AlcoholConsumption, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$AlcoholConsumption < lower_bound | data$AlcoholConsumption > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Alcohol Consumption variable have been detected.
ggplot(data = data, aes(x = PhysicalActivity)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$PhysicalActivity, 0.25)
Q3 <- quantile(data$PhysicalActivity, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$PhysicalActivity < lower_bound | data$PhysicalActivity > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Physical Activity variable have been detected.
ggplot(data = data, aes(x = DietQuality)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$DietQuality, 0.25)
Q3 <- quantile(data$DietQuality, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$DietQuality < lower_bound | data$DietQuality > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Diet Quality variable have been detected.
ggplot(data = data, aes(x = SleepQuality)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$SleepQuality, 0.25)
Q3 <- quantile(data$SleepQuality, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$SleepQuality < lower_bound | data$SleepQuality > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Sleep Quality variable have been detected.
ggplot(data = data, aes(x = SystolicBP)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$SystolicBP, 0.25)
Q3 <- quantile(data$SystolicBP, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$SystolicBP < lower_bound | data$SystolicBP > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Systolic BP variable have been detected.
ggplot(data = data, aes(x = DiastolicBP)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$DiastolicBP, 0.25)
Q3 <- quantile(data$DiastolicBP, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$DiastolicBP < lower_bound | data$DiastolicBP > upper_bound, ]
outliers## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Diastolic BP variable have been detected.
ggplot(data = data, aes(x = CholesterolTotal)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$CholesterolTotal, 0.25)
Q3 <- quantile(data$CholesterolTotal, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$CholesterolTotal < lower_bound | data$CholesterolTotal > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in CholesterolTotal variable have been detected.
ggplot(data = data, aes(x = CholesterolLDL)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$CholesterolLDL, 0.25)
Q3 <- quantile(data$CholesterolLDL, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$CholesterolLDL < lower_bound | data$CholesterolLDL > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in CholesterolLDL variable have been detected.
ggplot(data = data, aes(x = CholesterolHDL)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$CholesterolHDL, 0.25)
Q3 <- quantile(data$CholesterolHDL, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$CholesterolHDL < lower_bound | data$CholesterolHDL > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in CholesterolHDL variable have been detected.
ggplot(data = data, aes(x = CholesterolTriglycerides)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$CholesterolTriglycerides, 0.25)
Q3 <- quantile(data$CholesterolTriglycerides, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$CholesterolTriglycerides < lower_bound | data$CholesterolTriglycerides > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in CholesterolTriglycerides variable have been detected.
Q1 <- quantile(data$MMSE, 0.25)
Q3 <- quantile(data$MMSE, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$MMSE < lower_bound | data$MMSE > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in MMSE variable have been detected.
ggplot(data = data, aes(x = FunctionalAssessment)) +
geom_histogram(bins = 30, color = "red", fill = "lightpink")Q1 <- quantile(data$FunctionalAssessment, 0.25)
Q3 <- quantile(data$FunctionalAssessment, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$FunctionalAssessment < lower_bound | data$FunctionalAssessment > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in Functional Assessment variable have been detected.
Q1 <- quantile(data$ADL, 0.25)
Q3 <- quantile(data$ADL, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[data$ADL < lower_bound | data$ADL > upper_bound, ]
outliers ## [1] PatientID Age
## [3] Gender Ethnicity
## [5] EducationLevel BMI
## [7] Smoking AlcoholConsumption
## [9] PhysicalActivity DietQuality
## [11] SleepQuality FamilyHistoryAlzheimers
## [13] CardiovascularDisease Diabetes
## [15] Depression HeadInjury
## [17] Hypertension SystolicBP
## [19] DiastolicBP CholesterolTotal
## [21] CholesterolLDL CholesterolHDL
## [23] CholesterolTriglycerides MMSE
## [25] FunctionalAssessment MemoryComplaints
## [27] BehavioralProblems ADL
## [29] Confusion Disorientation
## [31] PersonalityChanges DifficultyCompletingTasks
## [33] Forgetfulness Diagnosis
## [35] DoctorInCharge
## <0 rows> (or 0-length row.names)
No outliers in ADL variable have been detected.
As observed, the outlier detection process, which involved inspecting histograms and boxplots along with calculating the IQR, did not reveal any significant outliers among the numerical variables. Therefore, no additional outlier handling is required.
In order to increase our understanding of the Alzheimer’s Disease Diagnosis and the variables that may affect the it, we will make use of a Exploratory Data Analysis (EDA) to examine the relevant relationships in detail. For EDA, we will use first visual inspection of relationship between variables through bar plots, histograms, box plots and density plots. Later, we will conduct hypothesis testing in order to validate the visual inspection. To better understand the variables that affect our target variable, we first examine all facets of the Alzheimer’s Disease Diagnosis.
As indicated by str(data) function, the Diagnosis
variable was mistakenly classified as an integer, even though it is a
binary variable. To correct this, the Diagnosis variable will be
converted to a factor type.
First, we examine a simple numerical distribution of the data.
## 0 1
## 1389 760
## [1] 0.3536529
ggplot(data = data) +
geom_bar(aes(x = Diagnosis), fill = c("pink", "lightblue")) +
labs(title = "Bar plot for the target variable 'Diagnosis'")
Based on the
summary() function and the distribution of
Diagnosis variable, we see that 36% of patients in our data
have a diagnosis of Alzheimer’s disease.
Now, we come to exploring the relations between
Diagnosis variable and our potential predictors. For
categorical predictors, we apply contingency table along with two types
of bar plots: a standard bar plot and another with same-sized bars,
which will allow for comparison of proportions among different
categories. For numerical variables, we apply boxplot as well as
histogram(for discrete variables) or density plot(for continuous
variables).
addmargins(table(data$Diagnosis, data$FamilyHistoryAlzheimers, dnn = c("Diagnosis of Alzheimer", "Family History: Alzheimers")))## Family History: Alzheimers
## Diagnosis of Alzheimer 0 1 Sum
## 0 1024 365 1389
## 1 583 177 760
## Sum 1607 542 2149
ggplot(data = data) +
geom_bar(aes(x = as.factor(FamilyHistoryAlzheimers), fill = Diagnosis)) +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Family History Alzheimer's")
ggplot(data = data) +
geom_bar(aes(x = as.factor(FamilyHistoryAlzheimers), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Family History Alzheimer's")
It seems that people who did not have a family history of Alzheimer,
might be slighty more likely to have Alzheimer than people who have a
family history of Alzheimer.
addmargins(table(data$Diagnosis, data$CardiovascularDisease, dnn = c("Diagnosis of Alzheimer", "Cardiovascular Disease")))## Cardiovascular Disease
## Diagnosis of Alzheimer 0 1 Sum
## 0 1200 189 1389
## 1 639 121 760
## Sum 1839 310 2149
ggplot(data = data) +
geom_bar(aes(x = as.factor(CardiovascularDisease), fill = Diagnosis)) +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Cardiovascular Disease")
ggplot(data = data) +
geom_bar(aes(x = as.factor(CardiovascularDisease), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Cardiovascular Disease")
Based on the standarzied bar plots, people who have Cardiovascular
disease seem to be slightly more likely to have Alzheimer.
## Diabetes
## Diagnosis of Alzheimer 0 1 Sum
## 0 1168 221 1389
## 1 657 103 760
## Sum 1825 324 2149
ggplot(data = data) +
geom_bar(aes(x = as.factor(Diabetes), fill = Diagnosis)) +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Diabetes")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Diabetes), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Diabetes")
People who have no diabetes are slightly more likely to have
ALzheimer.
## Depression
## Diagnosis of Alzheimer 0 1 Sum
## 0 1108 281 1389
## 1 610 150 760
## Sum 1718 431 2149
ggplot(data = data) +
geom_bar(aes(x = as.factor(Depression), fill = Diagnosis)) +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Depression")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Depression), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Depression")
Depression doesn’t seem to be indicative of whether someone has
Alzheimer or not.
addmargins(table(data$Diagnosis, data$HeadInjury, dnn = c("Diagnosis of Alzheimer", "Head Injury")))## Head Injury
## Diagnosis of Alzheimer 0 1 Sum
## 0 1254 135 1389
## 1 696 64 760
## Sum 1950 199 2149
ggplot(data = data) +
geom_bar(aes(x = as.factor(HeadInjury), fill = Diagnosis)) +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Head Injury")
ggplot(data = data) +
geom_bar(aes(x = as.factor(HeadInjury), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Head Injury")
People who did not have Head injury are slighlt more likely to be
diagnosed with Alzheimer.
addmargins(table(data$Diagnosis, data$Hypertension, dnn = c("Diagnosis of Alzheimer", "Hypertension")))## Hypertension
## Diagnosis of Alzheimer 0 1 Sum
## 0 1195 194 1389
## 1 634 126 760
## Sum 1829 320 2149
ggplot(data = data) +
geom_bar(aes(x = as.factor(Hypertension), fill = Diagnosis)) +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Hypertension")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Hypertension), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Hypertension")
People who have hypertension are more likely to be diagnosed with
Alzheimer.
ggplot(data = data) +
geom_bar(aes(x = SystolicBP, fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue"))
ggplot(data = data) +
geom_bar(aes(x = SystolicBP, fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue"))
Density plots suggest that people who have Systolic BP in range ~110 to
~140 might be more likely to have Alzheimer.
ggplot(data = data) +
geom_bar(aes(x = DiastolicBP, fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("DiastolicBP")
ggplot(data = data) +
geom_bar(aes(x = DiastolicBP, fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("DiastolicBP")
The density curves for both groups largely overlap, indicating that the
diastolic levels are quite similar between individuals with and without
Alzheimer’s. The curve for people diagnosed with Alzheimer suggests that
people having lower (~63 to ~75) Diastolic BP might be slightly more
likely to be diagnosed with Alzheimer.
ggplot(data = data) +
geom_boxplot(aes(x = Diagnosis, y = CholesterolTotal), fill = c("pink", "lightblue"))
The medians of cholesterol are basically the same as seen in the box
plot. The density curves for both groups largely overlap, indicating
that the cholesterol levels are quite similar between individuals with
and without Alzheimer’s. Cholesterol total doesn’t seem to be a
predictor of Alzheimer.
ggplot(data = data) +
geom_boxplot(aes(x = Diagnosis, y = CholesterolLDL), fill = c("pink", "lightblue"))
The medians of cholesterol LDL are very similar as seen in the box plot.
The density curves for both groups largely overlap, indicating that the
cholesterol LDL levels are quite similar between individuals with and
without Alzheimer’s. There is a slight peak in Alzheimer group, which
may suggest that people having cholesterol LDL of 50 to 90 may be more
likely to be diagnosed with Alzheimer.
ggplot(data = data) +
geom_boxplot(aes(x = Diagnosis, y = CholesterolHDL), fill = c("pink", "lightblue"))
The box plot and density plot suggest that people diagnosed with
Alzheimer are more likely to have higher levels of cholesterol HDL.
Thus, cholesterol HDL might be a signficant predictor of ALzheimer.
ggplot(data = data) +
geom_boxplot(aes(x = Diagnosis, y = CholesterolTriglycerides), fill = c("pink", "lightblue")) ggplot(data = data) +
geom_density(aes(x = CholesterolTriglycerides, fill = Diagnosis), alpha = 0.3)
The box plot suggest that people with Alzheimer have a higher median of
cholesterol triglycerides that people with no Alzheimer, however the
upper quantile and lower quantile are nearly the same. The density plot
suggests that people with cholesterol triglycerides levels of ~230 to
~310 might be more likely to have Alzheimer.
This data is very widespread and no significant trend is evident. It is
evident that people with Alzheimer’s tend to have lower MMSE scores
given the median, 25th, and 75th percentile are all lower to their
non-Alzheimer’s counterparts. People with no Alzheimer are more likely
to higher MMSE (> 23), while people with Alzheimer are more likely to
have lower MMSE. Thus, MMSE might be a useful predictor.
ggplot(data = data) +
geom_boxplot(aes(x = Diagnosis, y = FunctionalAssessment), fill = c("pink", "lightblue"))
The boxplot indicates that functional assessment is a very useful
variable given the scores are significantly lower in those with
Alzheimer’s than in those without. The functional assessment
demonstrates a very clear trend where people who score low on Functional
Assessment are significantly more likely to have Alzheimer’s, and people
who score high on Functional Assessment are less likely to have
Alzheimer’s. Thus, Functional Assessment might be an important predictor
of Alzheimer.
## Diagnosis
## Memory 0 1
## 0 1228 474
## 1 161 286
ggplot(data = data) +
geom_bar(aes(x = as.factor(MemoryComplaints), fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue"))
xlab("Memory Complaints")## $x
## [1] "Memory Complaints"
##
## attr(,"class")
## [1] "labels"
ggplot(data = data) +
geom_bar(aes(x = as.factor(MemoryComplaints), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Memory Complaints")
It appears only 286/760 or roughly 38% of diagnosed Alzheimer’s patients
complained about memory. This is still significantly higher than the 12%
of people who were not diagnosed but still complained about memory. The
bar plots demonstrates how having complaints about memory was a very
solid indicator for there being a higher chance of Alzheimer’s than for
those who did not complain of memory.
## Diagnosis
## Behavioral Problems 0 1
## 0 1255 557
## 1 134 203
ggplot(data = data) +
geom_bar(aes(x = as.factor(BehavioralProblems), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Behavioral Problems")
It is far less likely for a patient to have behavioral problems than the
opposite. Behavioral problems also serves as a very solid indicator of
Alzheimer’s as people with behavioral problems were more likely have
diagnosis of Alzheimer’s.
The ADL scores appear to be a little more concentrated around the tails.
ADL seems to be a very good indicator of Alzheimer’s because low scores
are more likely when having the diagnosis of Alzheimer.
## Diagnosis
## Confusion 0 1
## 0 1096 612
## 1 293 148
ggplot(data = data) +
geom_bar(aes(x = as.factor(Confusion), fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Confusion")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Confusion), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Confusion")
Confusion serves as a poor indicator for Alzheimer’s as the rates of
confusion’s between those who were diagnosed and weren’t diagnosed with
Alzheimer’s are very similar.
## Diagnosis
## Disorientation 0 1
## 0 1160 649
## 1 229 111
ggplot(data = data) +
geom_bar(aes(x = as.factor(Disorientation), fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Disorientation")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Disorientation), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Disorientation")
Disorientation seems to be quite uncommon with the vast majority of
patients not reporting it. Disorientation is another variable that
appears to be poor at predicting Alzheimer’s, given that the rate of the
disorientation problems is similar between Alzheimer group and
Non-Alzheimer group.
## Diagnosis
## Personality Changes 0 1
## 0 1172 653
## 1 217 107
ggplot(data = data) +
geom_bar(aes(x = as.factor(PersonalityChanges), fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Personality Changes")
ggplot(data = data) +
geom_bar(aes(x = as.factor(PersonalityChanges), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Personality Changes")
In general, reporting personality changes is very uncommon. Rates of
Personality Changes are very similar between two groups of diagnosis and
thus personality changes are rather a poor indicator of Alzheimer’s.
table(data$DifficultyCompletingTasks, data$Diagnosis, dnn = c("Difficulty Completing Tasks", "Diagnosis"))## Diagnosis
## Difficulty Completing Tasks 0 1
## 0 1172 636
## 1 217 124
ggplot(data = data) +
geom_bar(aes(x = as.factor(DifficultyCompletingTasks), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Difficulty Completing Tasks")
ggplot(data = data) +
geom_bar(aes(x = as.factor(DifficultyCompletingTasks), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Difficulty Completing Tasks")
Difficulty completing tasks was not very common in this data set. No
clear trend is visible. It seems that Difficulty completing tasks is not
a great indicator of Alzheimer’s diagnosis as rates of people having
difficulty with completing tasks are similar in both diagnosis
groups.
## Diagnosis
## Forgetfulness 0 1
## 0 970 531
## 1 419 229
ggplot(data = data) +
geom_bar(aes(x = as.factor(Forgetfulness), fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Forgetfulness")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Forgetfulness), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Forgetfulness")
Forgetfulness is more common in both groups than previous variables. The
rates of forgetfulness are almost exactly the same between those with
and without Alzheimer’s. This means very little can be determined about
Alzheimer’s diagnosis from that symptom alone.
ggplot(data = data) +
geom_bar(aes(x = Age, fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Age")
ggplot(data = data) +
geom_bar(aes(x = Age, fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Age")
Age, in this dataset, doesn’t seem to be a significant predictor of
Alzheimer disease.
ggplot(data = data) +
geom_bar(aes(x = as.factor(Gender), fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Gender")
ggplot(data = data) +
geom_bar(aes(x = as.factor(Gender), fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Gender")
There is no significant gender difference in Alzheimer diagnosis.
data$EthnicityFactor <- factor(data$Ethnicity,
levels = c(0, 1, 2, 3),
labels = c("Caucasian", "African American", "Asian","Other"))ggplot(data = data) +
geom_bar(aes(x = EthnicityFactor, fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Ethnicity")
ggplot(data = data) +
geom_bar(aes(x = EthnicityFactor, fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Ethnicity")
Asians seem to be slighly more likely to be diagnosed with
Alzheimer.
data$EducationLevelFactor <- factor(data$EducationLevel,
levels = c(0, 1, 2, 3),
labels = c("None", "High School", "Bachelor's","Higher"))ggplot(data = data) +
geom_bar(aes(x = EducationLevelFactor, fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Education")
ggplot(data = data) +
geom_bar(aes(x = EducationLevelFactor, fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Education")
Individuals with higher level of educations are less likely to be
diagnosed with Alzheimers. Thus, education level seem to be an important
predictor.
The boxplot show that individuals diagnosed with Alzheimers have slighly
higher BMI. The density plots reveals that individuals with BMI larger
than 33 are more likely to have Alzheimer.
ggplot(data = data) +
geom_bar(aes(x = SmokingFactor, fill = Diagnosis), position = "stack") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Smoking")
ggplot(data = data) +
geom_bar(aes(x = SmokingFactor, fill = Diagnosis), position = "fill") +
scale_fill_manual(values = c("pink", "lightblue")) +
xlab("Smoking")
Smoking doesn’t seem to be an important predictor of Alzheimer.
ggplot(data = data) +
geom_boxplot(aes(x = Diagnosis , y = AlcoholConsumption),
fill = c("pink", "lightblue"))
The median seems to be the same for people with ALzheimer and for people
without Alzheimer, thus alcohol consumption doesn’t seem to be an
important predictor for Alzheimer.
ggplot(data = data) +
geom_boxplot(aes(x = PhysicalActivity, y = Diagnosis),
fill = c("pink", "lightblue"))
The boxplot and denisty plot suggest that there is Physical activity is
not a significant predictor of Alzheimer as the density curves overlap a
lot and median, 1st quantile and 3rd quantile are very similar.
ggplot(data = data) +
geom_boxplot(aes(x = DietQuality, y = Diagnosis), fill = c("pink", "lightblue"))
The density plot shows that their is a lot of overlap between both
groups, suggesting that diet quality is not a signifcant predictor.
ggplot(data = data) +
geom_boxplot(aes(x = SleepQuality, y = Diagnosis),
fill = c("pink", "lightblue"))
The boxplot suggests that individuals without Alzheimer have a better
sleep quality. The density plot confirms that, therefore individuals
with Alzheimer might be more likely to have worse sleep quality,
potentially being a good predictor of Alzheimer.
We analyzed six major areas of health information, each with several variables (32 total). The variables analyzed were demographic details, medical history, clinical measurements, cognitive and functional assessments, symptoms and diagnosis information. Diagnosis information served as the variable to which we compared all of these variables, in order to determine which variable could best predicted Alzheimer’s. Using EDA, we decided that variables below can be considered as potential predictors of Alzheimer’s disease.
Ethnicity - Asians appear slightly more likely than
other ethnicities to be diagnosed with Alzheimer’s.EducationLevel - The rate of Alzheimer’s decreases as
the level of education increases, but the differences are not vast.BMI - Individuals with Alzheimer’s have slightly higher
BMIs than those without Alzheimer’s. A BMI above 32.5 appears to
demonstrate the greatest difference between Alzheimer’s diagnoses, but
overall, the difference is not very significant.FamilyHistoryAlzheimers - The data shows that people
without a family history of Alzheimer’s had a slightly higher likelihood
of being diagnosed with it.CardiovascularDisease - Those with cardiovascular
disease had a slightly higher rate of Alzheimer’s.Diabetes - Diabetics appear slightly less likely to
have Alzheimer’s, but the difference is not large.HeadInjury - Those without head injuries were slightly
less likely to have Alzheimer’s, but again, this difference is not very
large.Hypertension - People with hypertension were slightly
more likely to have an Alzheimer’s diagnosis than those without
hypertension.SystolicBP - The density plot suggests people with a
systolic BP between 110-140 may be slightly more likely to have an
Alzheimer’s diagnosis.DiastolicBP - The curves mirror each other very
closely, but it also indicates people with a Diastolic BP of 63 to 75
are slightly more likely to have Alzheimer’s.CholesterolHDL - The box plot and density plot indicate
that people with higher levels of Cholesterol HDL are more likely to
have Alzheimer’s and vice-versa.CholesterolTriglycerides - People with Alzheimer’s have
a higher median for cholesterol triglycerides, indicating that it may be
a suitable predicting factor.MMSE - People with no Alzheimer are more likely to
higher MMSE (> 23), while people with Alzheimer are more likely to
have lower MMSE, making it a very suitable predictor of
Alzheimer’s.FunctionalAssessment - The functional assessment scores
of those with Alzheimer’s are much lower compared to people who have not
been diagnosed, making it a good predictor of Alzheimer’s.MemoryComplaints - Only 12% of non-Alzheimer patients
complained of memory problems, which is in stark contrast to the 38% of
Alzheimer’s patients who complained, meaning there is a very significant
difference between the two.BehavioralComplaints - People with behavioral problems
were far more likely to have Alzheimer’s than people without behavioral
problems, making it a suitable indicator.ADL - ADL scores are very low in people with
Alzheimer’s and much higher in those without, making it a very accurate
predictor of Alzheimer’s.Sleep quality - Individuals with Alzheimer are be more
likely to have worse sleep quality.In order to validate the chosen predictors as deducted from graphical inspection we conduct hypothesis testing on these variables.
##
## Pearson's Chi-squared test
##
## data: table(data$Diagnosis, data$Ethnicity)
## X-squared = 6.3021, df = 3, p-value = 0.0978
P-value (0.0978) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Alzheimer diagnosis among different ethnitic groups.
##
## Pearson's Chi-squared test
##
## data: table(data$Diagnosis, data$EducationLevel)
## X-squared = 4.4531, df = 3, p-value = 0.2165
P-value (0.2165) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Alzheimer diagnosis among different education levels.
##
## Welch Two Sample t-test
##
## data: BMI by Diagnosis
## t = -1.2148, df = 1537.9, p-value = 0.2246
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.0395624 0.2444058
## sample estimates:
## mean in group 0 mean in group 1
## 27.51509 27.91267
P-value (0.2246) is greater than alpha level and so we do not reject null hypothesis. There is no significant difference in the mean BMI between people diagnosed with Alzheimer and people without Alzheimer.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(data$FamilyHistoryAlzheimers, data$Diagnosis)
## X-squared = 2.1703, df = 1, p-value = 0.1407
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.08340223 0.01096316
## sample estimates:
## prop 1 prop 2
## 0.6372122 0.6734317
P-value (0.1407) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Family History Alzheimer between people who have Alzheimer’s and people who don’t have Alzheimer’s.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(data$CardiovascularDisease, data$Diagnosis)
## X-squared = 1.9477, df = 1, p-value = 0.1628
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01753589 0.10323815
## sample estimates:
## prop 1 prop 2
## 0.6525285 0.6096774
P-value (0.1628) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Cardiovascular disease between people who have Alzheimer and people who don’t have Alzheimer.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(data$Diabetes, data$Diagnosis)
## X-squared = 1.9532, df = 1, p-value = 0.1622
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.09919617 0.01499864
## sample estimates:
## prop 1 prop 2
## 0.6400000 0.6820988
P-value (0.1622) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Diabetes diagnosis between people who have Alzheimer and people who don’t have Alzheimer.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(data$HeadInjury, data$Diagnosis)
## X-squared = 0.83677, df = 1, p-value = 0.3603
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.10637604 0.03574597
## sample estimates:
## prop 1 prop 2
## 0.6430769 0.6783920
Since the p value is not smaller than 0.05 we don’t have enough evidence that their is a significant difference. There is no significant difference in proportion of Head Injury between people who have Alzheimer and people who don’t have Alzheimer.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(data$Hypertension, data$Diagnosis)
## X-squared = 2.4425, df = 1, p-value = 0.1181
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01252733 0.10675232
## sample estimates:
## prop 1 prop 2
## 0.6533625 0.6062500
Since the p value is not smaller than 0.05 we don’t have enough evidence that their is a significant difference, so we do not reject H0. There is no significant difference in proportion of Hypertension between people who have Alzheimer and people who don’t have Alzheimer.
##
## Welch Two Sample t-test
##
## data: SystolicBP by Diagnosis
## t = 0.7235, df = 1560.4, p-value = 0.4695
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.449881 3.144540
## sample estimates:
## mean in group 0 mean in group 1
## 134.5644 133.7171
Since the p value is larger than 0.05 we don’t have enough evidence that their is a significant difference, so we do not reject H0. There is no significant difference in mean of Systolic BP between people who have Alzheimer and people who don’t have Alzheimer.
##
## Welch Two Sample t-test
##
## data: DiastolicBP by Diagnosis
## t = -0.24612, df = 1577.4, p-value = 0.8056
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.746453 1.357040
## sample estimates:
## mean in group 0 mean in group 1
## 89.77898 89.97368
Since the p value is larger than 0.05 we don’t have enough evidence that their is a significant difference. There is no significant difference in mean of Diastolic BP between people who have Alzheimer and people who don’t have Alzheimer.
##
## Welch Two Sample t-test
##
## data: CholesterolHDL by Diagnosis
## t = -1.9706, df = 1551.2, p-value = 0.04895
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -4.111498327 -0.009497238
## sample estimates:
## mean in group 0 mean in group 1
## 58.73483 60.79533
Since the P-value is less than 0.05, we reject H0. The difference in the mean number of CholesterolHDL between both groups is statistically significant.
##
## Welch Two Sample t-test
##
## data: CholesterolTriglycerides by Diagnosis
## t = -1.0502, df = 1558.6, p-value = 0.2938
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -13.866075 4.195806
## sample estimates:
## mean in group 0 mean in group 1
## 226.5715 231.4067
Since the P-value is larger than 0.05 , we do not reject H0. Thus the difference in the mean number of CholesterolTriglycerides between both groups is not statistically significant.
##
## Welch Two Sample t-test
##
## data: MMSE by Diagnosis
## t = 12.025, df = 1851.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 3.574302 4.967469
## sample estimates:
## mean in group 0 mean in group 1
## 16.26554 11.99466
MMSE scores were significantly lower in patients with Alzheimer’s and this is statistically significant given the p-value is 2.2e-16. Thus, MMSE might be a useful predictor of Alzheimer.
##
## Welch Two Sample t-test
##
## data: FunctionalAssessment by Diagnosis
## t = 18.552, df = 1660.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 1.973921 2.440657
## sample estimates:
## mean in group 0 mean in group 1
## 5.860669 3.653380
Functional Assessment scores were significantly lower in patients with Alzheimer’s and this is also statistically significant given the p-value is 2.2e-16. Thus, Functional assessment might be a useful predictor of Alzheimer.
##
## Welch Two Sample t-test
##
## data: MemoryComplaints by Diagnosis
## t = -13.305, df = 1129.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.2988062 -0.2220039
## sample estimates:
## mean in group 0 mean in group 1
## 0.1159107 0.3763158
There were more memory complaints in patients with Alzheimer’s and this is statistically significant given the p-value is again 2.2e-16. Thus, memory complaints might be a useful predictor of Alzheimer.
##
## Welch Two Sample t-test
##
## data: BehavioralProblems by Diagnosis
## t = -9.528, df = 1136.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.2057706 -0.1354954
## sample estimates:
## mean in group 0 mean in group 1
## 0.09647228 0.26710526
Behavioral problems were significantly more common in patients with Alzheimer’s and this is statistically significant given the p-value of 2.2e-16. Thus, behavioural complaints might be useful in predicting the diagnosis of Alzheimer.
##
## Welch Two Sample t-test
##
## data: ADL by Diagnosis
## t = 16.546, df = 1622.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 1.807000 2.293026
## sample estimates:
## mean in group 0 mean in group 1
## 5.707951 3.657938
ADL scores were significantly lower in patients with Alzheimer’s and this is statistically significant given the p-value of 2.2e-16. ADL might be a useful predictor of Alzheimer’s diagnosis.
##
## Welch Two Sample t-test
##
## data: SleepQuality by Diagnosis
## t = 2.6282, df = 1567.7, p-value = 0.008669
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.05289963 0.36417980
## sample estimates:
## mean in group 0 mean in group 1
## 7.124832 6.916292
Sleep quality was lower in patients with Alzheimer’s and it is a statistically significant different given the p-value of 0.008669, still being lower than alpha level(0.05). Sleep quality might also be a useful preditor of Alzheimer’s diagnosis.
Variables that seem to be a useful predictor of Alzheimer’s based on hypothesis testing:
Cholesterol HDLMMSEFunctional AssessmentMemory ComplaintsBehavioral ComplaintsADLSleep QualityIn this stage, the dataset will be prepared for the modeling section. Here, our dataset is partitioned randomly into two groups: train set (80%) and test set (20%). Here, the partition() function is used from the liver package, by inputting a random seed beforehand.
set.seed(5)
data_sets = partition(data = data, prob = c(0.8, 0.2))
train_set_A = data_sets$part1
test_set_A = data_sets$part2
actual_test_A = test_set_A$DiagnosisSince the target variable Diagnosis is binary, we will validate the partion by inspecting whether the proportion of Diagnosis differ between train and test set. For this we will use two sample z-test, with a signifcance level of 0.05. Based on these, the following hypotheses are stated:
Null Hypothesis (\(H_0\)): There is no significant difference between the proportions of the two groups. \[ H_0: p_{\text{Diagnosis, trainset}} = p_{\text{Diagnosis, testset}} \]
Alternative Hypothesis (\(H_1\)): There is a significant difference between the proportions of the two groups. \[ H_1: p_{\text{Diagnosis, trainset}} \neq p_{\text{Diagnosis, testset}} \]
x1 = sum(train_set_A$Diagnosis == 1)
x2 = sum(test_set_A$Diagnosis == 1)
n1 = nrow(train_set_A)
n2 = nrow(test_set_A)
prop.test(x = c(x1, x2), n = c(n1, n2))##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x1, x2) out of c(n1, n2)
## X-squared = 2.5685, df = 1, p-value = 0.109
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.094033560 0.009761076
## sample estimates:
## prop 1 prop 2
## 0.3448884 0.3870246
There is not enough evidence to suggest a statistically significant difference in the proportions of Diagnosis = 1 between the train and test sets. Since p-value (0.109) is greater than alpha level, we do not reject null hypothesis. Thus, we may follow with further analysis as train data set seems to be representative of test set.
Considering the theoretical framework and outputs of the Exploratory Data Analysis (EDA), out of the 32 predictors in the dataset, the following variables are the ones that are detected to have an influence on the target variable and deemed to be important: CholesterolHDL, MMSE, FunctionalAssessment, MemoryComplaints, BehavioralProblems, ADL and SleepQuality. Given the variables and data partitioning, different algorithms will be utilized to determine the effects of such variables on Diagnosis. As such, for the creation of different modeling algorithms, the following formula is created:
formula = Diagnosis~ CholesterolHDL + MMSE + FunctionalAssessment + MemoryComplaints +BehavioralProblems + ADL + SleepQualityFirst, we will start with applying kNN algorithm. Second, we will explore Naive Bayes Classifcation. Lastly, we will go through Logistc regression.
The optimal value of k will be chosen based on the Error Rate, using
the function kNN.plot().
kNN.plot(formula, train = train_set_A, test = test_set_A, transform = 'minmax',
k.max = 30, set.seed = 7)The optimal value of k seems to be k = 25, as the error rate is the lowest.
The Naive Bayes Classifier model will be created by calling the
naive_bayes() function in R.
train_set_A$Diagnosis <- as.factor(train_set_A$Diagnosis)
test_set_A$Diagnosis <- as.factor(test_set_A$Diagnosis)
naive_bayes = naive_bayes(formula, data = train_set_A)
naive_bayes##
## ================================= Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = formula, data = train_set_A)
##
## --------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## --------------------------------------------------------------------------------
##
## A priori probabilities:
##
## 0 1
## 0.6551116 0.3448884
##
## --------------------------------------------------------------------------------
##
## Tables:
##
## --------------------------------------------------------------------------------
## :: CholesterolHDL (Gaussian)
## --------------------------------------------------------------------------------
##
## CholesterolHDL 0 1
## mean 58.78509 60.32494
## sd 23.22257 23.11797
##
## --------------------------------------------------------------------------------
## :: MMSE (Gaussian)
## --------------------------------------------------------------------------------
##
## MMSE 0 1
## mean 16.027019 11.946981
## sd 8.921653 7.213687
##
## --------------------------------------------------------------------------------
## :: FunctionalAssessment (Gaussian)
## --------------------------------------------------------------------------------
##
## FunctionalAssessment 0 1
## mean 5.931812 3.689126
## sd 2.793780 2.613157
##
## --------------------------------------------------------------------------------
## :: MemoryComplaints (Gaussian)
## --------------------------------------------------------------------------------
##
## MemoryComplaints 0 1
## mean 0.1192825 0.3611584
## sd 0.3242661 0.4807460
##
## --------------------------------------------------------------------------------
## :: BehavioralProblems (Gaussian)
## --------------------------------------------------------------------------------
##
## BehavioralProblems 0 1
## mean 0.09596413 0.26575809
## sd 0.29467421 0.44211279
##
## --------------------------------------------------------------------------------
##
## # ... and 2 more tables
##
## --------------------------------------------------------------------------------
##
## ================================= Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = formula, data = train_set_A)
## - Laplace: 0
## - Classes: 2
## - Samples: 1702
## - Features: 7
## - Conditional distributions:
## - Gaussian: 7
## - Prior probabilities:
## - 0: 0.6551
## - 1: 0.3449
##
## --------------------------------------------------------------------------------
In order to conduct a logistic regression analysis, the following model will be used:
\[ p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7)}} \]
The model will be created by calling the glm() function
in R and use summary() function to analyze the significance
and the coefficients of the variables.
##
## Call:
## glm(formula = formula, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.973338 0.370435 10.726 <2e-16 ***
## CholesterolHDL 0.004878 0.002708 1.802 0.0716 .
## MMSE -0.107115 0.008081 -13.256 <2e-16 ***
## FunctionalAssessment -0.445320 0.026087 -17.071 <2e-16 ***
## MemoryComplaints 2.586858 0.165096 15.669 <2e-16 ***
## BehavioralProblems 2.464977 0.180858 13.629 <2e-16 ***
## ADL -0.414363 0.025554 -16.215 <2e-16 ***
## SleepQuality -0.056527 0.035619 -1.587 0.1125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2792.3 on 2148 degrees of freedom
## Residual deviance: 1602.4 on 2141 degrees of freedom
## AIC: 1618.4
##
## Number of Fisher Scoring iterations: 6
Based on the output, it can be seen that MMSE, FunctionalAssessment, MemoryComplaints, BehaviouralProblems, ADL are significant at alpha = 0.001, and CholesterolHDL is significant at the alpha level of 0.1. Sleep Quality has a high p-value(0.1125) and is not significant. Therefore, the removal of this variable could be considered.
After the model is created, it is important to check whether it
satisfies the assumptions and requirements of the modeling section. In
this sense, the multicollinearity will be tested with
car::vif()
## CholesterolHDL MMSE FunctionalAssessment
## 1.004962 1.192945 1.295570
## MemoryComplaints BehavioralProblems ADL
## 1.252231 1.258978 1.300856
## SleepQuality
## 1.001082
None of VIF values of the predictors exceeds 5, indicating no collinearity. Thus, no assumption is violated.
To evaluate predictive models, which include binary variable as target variable, Confusion Matrix, ROC curve and AUC will be used.
predict_knn_25_trans = kNN(formula, train = train_set_A, test = test_set_A, transform = "minmax", k = 25)
conf.mat.plot(predict_knn_25_trans, actual_test_A)This confusion matrix demonstrates the algorithm and the chosen predictor variables are quite effective in predicting Alzheimer’s. Sensitivity is equal to 152/173 = 0.879, meaning than 87.9% of predicted Alzheimer’s patients were actually diagnosed with Alzheimer’s. Specificity equals 255/274 = 0.93. This means that our model correctly identifies 93.1% of the negative class(no Alzheimer diagnosis). This suggests that the model is very accurate in predictions.
prob_knn = kNN(formula, train = train_set_A, test = test_set_A, transform = "minmax", k = 25, type = "prob")[, 1]
roc_knn = roc(actual_test_A, prob_knn)## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
ggroc(roc_knn, size = 0.8) +
theme_minimal() +
ggtitle(paste("ROC plot for kNN; AUC =", round(auc(roc_knn), 3))) +
theme(legend.title = element_blank()) +
theme(legend.position = "inside", text = element_text(size = 17))
The ROC curve and AUC of 0.936, suggests that kNN performs well in
predciting our data.
prob_naive_bayes = predict(naive_bayes, test_set_A, type = "prob")[, 1]
conf.mat(prob_naive_bayes, actual_test_A, cutoff = 0.5, reference = "0")## Actual
## Predict 0 1
## 0 242 48
## 1 32 125
There are 125 True Positives and 242 True Negatives. However, there are
48 False Negatives and 32 False Positives. This implies that sensitivity
and specificity are, 0.72 and 0.88 respectively. This means that
predictive accuracy of the model is relatively good, although we can
already see it seems to be worse than kNN.
prob_naive_bayes = predict(naive_bayes, test_set_A, type = "prob")[, 1]
roc_naive_bayes = roc(actual_test_A, prob_naive_bayes)## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
ggroc(roc_naive_bayes, size = 0.8) +
theme_minimal() +
ggtitle(paste("ROC plot for Naive Bayes; AUC =", round(auc(roc_naive_bayes), 3))) +
theme(legend.title = element_blank()) +
theme(legend.position = "inside", text = element_text(size = 17))
The ROC curve and AUC of 0.934, suggests that Naive Bayes Classifier
performs well.
prob_logreg <- predict(logreg, test_set_A, type = "response")
conf.mat.plot(prob_logreg, actual_test_A, cutoff = 0.5, reference = "1")
This confusion matrix demonstrates the algorithm and the chosen
predictor variables are quite effective in predicting Alzheimer’s.
Sensitivity is equal to 135/173, meaning than 78% of predicted
Alzheimer’s patients were actually diagnosed with Alzheimer’s.
Specificity equals 249/274 = 0.91. This means that our model correctly
identifies 91% of the negative class(no Alzheimer diagnosis). This
suggests that the model is rather accurate in predictions.
prob_logreg <- predict(logreg, test_set_A, type = "response")
roc_logreg_1 = roc(actual_test_A, prob_logreg)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(roc_logreg_1, size = 0.8) +
theme_minimal() +
ggtitle(paste("ROC plot for Logistic Regression; AUC =", round(auc(roc_logreg_1), 3))) +
theme(legend.title = element_blank()) +
theme(legend.position = "inside", text = element_text(size = 17))
The ROC curve and AUC of 0.919, suggests that Logistic Regression
performs well in this context.
ggroc(list(roc_naive_bayes, roc_knn, roc_logreg_1), size = 0.8) +
theme_minimal() +
ggtitle("ROC plots with their AUC values") +
scale_color_manual(values = 1:3,
labels = c(
paste("Bayes; AUC=", round(auc(roc_naive_bayes), 3)),
paste("KNN; AUC=", round(auc(roc_knn), 3)),
paste("Logistic Regression; AUC=", round(auc(roc_logreg_1), 3))
)) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.7, 0.3), text = element_text(size = 17))The AUC values range from 0.919 to 0.936, indicating that all three models have very good discriminatory power. Nevertheless, KNN slightly outperforms the other models with AUC equal to 0.936.
Moreover, when comparing the confusion matrices of three algorithms: kNN, Naive Bayes Classifier and Logistic Regression, it is evident that kNN is the most accurate having a sensitivity of 87.9% and a specificity of 93.1%. The other two algorithms even though were still good, could not as accurately predict Alzheimer’s as kNN, and therefore kNN can be demonstrated to be the most optimal model for predicting Alzheimer’s in this case.
Our model identified several key predictors for Alzheimer’s
diagnosis, including: Cholesterol HDL, MMSE,
Functional Assessment, Memory Complaints,
Behavioral Complaints, ADL and
Sleep Quality. These variables were found to have the most
significant associations with Alzheimer’s diagnosis.
This exploratory research consists of six main hypotheses regarding the capabilities of different categories of variables to predict an Alzheimer’s diagnosis. Given the findings of individual comparative statistics and the predictive models, these hypotheses will be reviewed and assessed. H1 proposed that older age would be associated with a higher likelihood of Alzheimer’s diagnosis. The results of the data analysis revealed that age is not statistically significant in determining whether an individual has Alzheimer’s or not. Based on this finding, there is insufficient support for H1 and so we do not accept it. H2 suggested that a healthier lifestyle, characterized by a lower BMI, non-smoking status, low alcohol consumption, regular physical activity, good diet quality, and better sleep quality, would be associated with a lower likelihood of Alzheimer’s diagnosis. Sleep quality was determined by the analysis to be a statistically significant predictor of Alzheimer’s, whereas the other factors were not found to be statistically significant. Thus, some effect for H2 was found and it can be partially accepted. H3 stated that history of chronic health conditions such as cardiovascular disease, diabetes, depression, hypertension, and head injury and a family history of Alzheimer’s would increase the likelihood of an Alzheimer’s diagnosis. None of these factors were found to be statistically significant in determining an Alzheimer’s diagnosis. Thus, H3 cannot be accepted. H4 suggested that poor cardiovascular health, indicated by a variable such as high blood pressure and unfavorable cholesterol levels (high total cholesterol, high LDL, low HDL, and high triglycerides), would be associated with a higher likelihood of Alzheimer’s diagnosis. CholesterolHDL was, indeed, found to be a significant predictor of whether or not a person would be diagnosed with Alzheimer’s. The model supports this hypothesis, meaning H4 can be partially accepted. H5 proposed that lower scores on cognitive and functional assessments and the presence of memory and behavioral complaints would result in a higher likelihood of Alzheimer’s diagnosis. The data analysis revealed that MMSE scores, FunctionalAssessment scores, MemoryComplaints, BehavioralComplaints, and ADL scores were all statistically significant predictors of an Alzheimer’s diagnosis. This indicates a full support for H5 and so it is accepted. Finally, H6 suggested that the presence of cognitive and behavioral symptoms such as confusion, disorientation, and forgetfulness would be positively associated with Alzheimer’s diagnosis. Within this category, none of the variables was determined to be statistically significant. Thus, H6 is not accepted.
Given the challenges surrounding the prediction, diagnosis, and treatment of Alzheimer’s, there has long been difficulty in assessing and predicting an Alzheimer’s diagnosis. As demonstrated in previous literature, many of the factors that lead to or indicate Alzheimer’s can be difficult to notice. Thus, this research is important as it demonstrates how data can be used to determine which factors are the best predictors of the disease, and that the models can also be utilized in a medical setting. That said, of the plethora of factors that indicate Alzheimer’s, all of them must be tested rigorously to ensure the proper ones are selected to improve the quality of healthcare outcomes for current and future patients.
Our analysis revealed predictors, including Cholesterol HDL, MMSE scores, Functional Assessment, Memory Complaints, Behavioral Complaints, ADL scores, and Sleep Quality, that were significantly associated with Alzheimer’s diagnosis, thereby fulfilling our research objective. These findings underscore the importance of both cognitive assessments and lifestyle factors in identifying individuals at higher risk for Alzheimer’s. Considering the objective and the methodology of the research, the results of the data analysis suggest that some predictors might be especially important in the prediction of Alzheimer’s. Age, for example, was originally predicted to be a very significant indicator of Alzheimer’s diagnosis, however, the findings of this research did not have enough support to accept this hypothesis drawn from previous literature. Given the nature of the data, this finding is likely due to the fact that the dataset did not contain a wide range of ages, but more elderly people, meaning there is not necessarily a contradiction with pre existing literature. Regarding lifestyle factors, pre-existing literature suggested that a healthier lifestyle would be protective against Alzheimer’s. The model found that sleep quality was the most important factor within the category of lifestyle.This highlights the importance of sleep quality in relation to cognitive health.For chronic health conditions, despite the widespread belief that conditions like cardiovascular disease, diabetes, and depression increase Alzheimer’s risk, the analysis found no significant associations for these variables. This may reflect the fact that the differences between individuals with and without these conditions were not substantial enough to influence the model’s predictions. On the contrary, two categories which did have significant effects on predicting Alzheimer’s diagnosis were clinical measurement and cognitive/functional assessments. Cholesterol was suggested in previous literature to have a negative effect on cognitive health, and this was supported by the findings of the model, which suggested that higher levels of Cholesterol HDL were positively correlated with Alzheimer’s diagnosis. As for the cognitive and functional assessments, these are very clearly significant in the prediction of Alzheimer’s diagnosis/ This data corroborated the findings of previous research, which demonstrated that people with Alzheimer’s performed far worse on similar tests than those without Alzheimer’s.
The results of our study offer several actionable insights for both clinical practical and research. First, preventive strategies should focus on addressing modifiable risk factors that emerged as significant in our analysis. More specifically, improving sleep quality should be a major focus point in the efforts to prevent Alzheimer’s, given the significant link between poor sleep quality and the increased risk for Alzheimer’s. Healthcare providers should also take more action in the process of monitoring and improving a person’s sleep quality. This could be implemented by sleep therapy to enhance a person’s sleep quality and patterns. Another important aspect of preventing Alzheimer’s is cholesterol management. Our study indicates that the higher the HDL cholesterol levels are the higher the risks of being diagnosed with Alzheimer’s. As such, healthcare providers should regularly monitor HDL levels and intervene when necessary. Second, early detection is incredibly important in the case of Alzheimer’s disease, thus the most reliable predictors identified in this study should be taken into account. Functional assessments such as MMSE are crucial here, and so should remain an important part in the screening protocols and be regularly used on the high-risk-populations. Another two significant predictors are the Behavioral Complaints and the Memory Complaints. These symptoms are also an important part of the screening for early signs of Alzheimer’s. A right protocol to monitor these predictors is an essential part in the early detection of Alzheimers’s, and by the use of these predictors the early diagnosis rates could improve. Third, patient care should be improved. A key predictor we found in our study is the activities of daily living (ADL). Thus, maintaining a patient’s independence in daily tasks for as long as possible is vital. Encouraging patients to develop consistent habits around daily tasks can extend their independence. Where necessary, a tailored care plan should be introduced to support patients in maintaining these activities for as long as possible.
Future research into Alzheimer’s determinants and potential prevention measures still is needed. Our research revealed some patterns and trends that can be further tested. A lot still is unknown. For example, how long term improvement in these areas affect the risk of being diagnosed with Alzheimer’s, giving potential areas for future research.
Alzheimer’s Association. (2024). 2024 Alzheimer’s Disease Facts and Figures. Alzheimer’s Association. https://www.alz.org/media/Documents/alzheimers-facts-and-figures.pdf Breijyeh, Z., & Karaman, R. (2020). Comprehensive Review on Alzheimer’s Disease: Causes and Treatment. Molecules, 25(24), 5789. https://doi.org/10.3390/molecules25245789 El Kharoua, R. (2024). Alzheimer’s Disease Dataset. Kaggle.com. https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset/data?select=alzheimers_disease_data.csv Suh, G., Ju, Y., Yeon, B. K., & Shah, A. (2004). A longitudinal study of Alzheimer’s disease: rates of cognitive and functional decline. International Journal of Geriatric Psychiatry, 19(9), 817–824. https://doi.org/10.1002/gps.1168